Integrating LLMs into Simple Apps
Learn how to integrate Large Language Models (LLMs) into your applications in 15 minutes! 🚀
🎯 What We'll Build
A simple Python application that:
- Connects to an LLM API (OpenAI/local model)
- Sends prompts and receives responses
- Handles errors and retries
- Streams responses for better UX
📋 Prerequisites
# Install required packages
pip install openai python-dotenv
What you need:
- Python 3.8+
- API key from OpenAI (or use a local model like Ollama)
- Basic understanding of Python
🔑 The Fundamentals
What is an LLM?
Large Language Model - An AI model trained on massive amounts of text that can:
- Understand and generate human-like text
- Answer questions
- Summarize content
- Write code
- Translate languages
- And much more!
Popular LLM Providers
| Provider | Model Examples | Best For |
|---|---|---|
| OpenAI | GPT-4, GPT-3.5-turbo | Production apps, high quality |
| Anthropic | Claude 3 | Long context, safety |
| Local (Ollama) | Llama 2, Mistral | Privacy, no API costs |
| Gemini | Multimodal tasks | |
| Azure OpenAI | GPT-4 | Enterprise, compliance |
🏗️ Basic Integration
1. Setup Environment Variables
Create a .env file:
# .env
OPENAI_API_KEY=sk-your-api-key-here
2. Simple Chat Completion
# simple_llm.py
import os
from openai import OpenAI
from dotenv import load_dotenv
# Load environment variables
load_dotenv()
# Initialize OpenAI client
client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
def chat(prompt: str, model: str = "gpt-3.5-turbo") -> str:
"""
Send a prompt to the LLM and get a response
Args:
prompt: The user's message
model: Which model to use
Returns:
The LLM's response as a string
"""
try:
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
],
temperature=0.7, # Controls randomness (0-2)
max_tokens=500, # Maximum response length
)
return response.choices[0].message.content
except Exception as e:
return f"Error: {str(e)}"
# Test it
if __name__ == "__main__":
result = chat("Explain what an API is in one sentence.")
print(result)
Output:
An API (Application Programming Interface) is a set of rules and protocols
that allows different software applications to communicate with each other.
💬 Conversation with Context
LLMs can maintain context across multiple messages:
# conversation.py
from openai import OpenAI
from dotenv import load_dotenv
import os
load_dotenv()
client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
class ChatBot:
def __init__(self, system_prompt: str = "You are a helpful assistant."):
"""Initialize chatbot with conversation history"""
self.messages = [
{"role": "system", "content": system_prompt}
]
def chat(self, user_message: str) -> str:
"""Send a message and get response, maintaining context"""
# Add user message to history
self.messages.append({"role": "user", "content": user_message})
# Get response from LLM
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=self.messages,
temperature=0.7
)
# Add assistant response to history
assistant_message = response.choices[0].message.content
self.messages.append({"role": "assistant", "content": assistant_message})
return assistant_message
def reset(self):
"""Clear conversation history"""
self.messages = self.messages[:1] # Keep only system message
# Example usage
if __name__ == "__main__":
bot = ChatBot(system_prompt="You are a Python expert assistant.")
# Multi-turn conversation
print("Bot:", bot.chat("What is a list comprehension?"))
print("\nBot:", bot.chat("Can you show me an example?"))
print("\nBot:", bot.chat("What about with a condition?"))
🌊 Streaming Responses
For better UX, stream responses as they're generated:
# streaming.py
from openai import OpenAI
import os
from dotenv import load_dotenv
load_dotenv()
client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
def stream_chat(prompt: str):
"""Stream LLM response word by word"""
stream = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}],
stream=True # Enable streaming
)
print("Assistant: ", end="", flush=True)
for chunk in stream:
if chunk.choices[0].delta.content:
content = chunk.choices[0].delta.content
print(content, end="", flush=True)
print() # New line at end
# Test streaming
if __name__ == "__main__":
stream_chat("Write a haiku about programming.")
Output (appears word by word):
Assistant: Code flows like a stream,
Logic branches intertwine,
Bugs lurk in shadows.
🔄 Error Handling & Retries
Production-ready integration with retry logic:
# robust_llm.py
from openai import OpenAI
import os
import time
from dotenv import load_dotenv
load_dotenv()
client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
def chat_with_retry(
prompt: str,
max_retries: int = 3,
retry_delay: int = 2
) -> str:
"""
Chat with automatic retry on failure
Args:
prompt: User message
max_retries: Number of retry attempts
retry_delay: Seconds to wait between retries
"""
for attempt in range(max_retries):
try:
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}],
timeout=30 # 30 second timeout
)
return response.choices[0].message.content
except Exception as e:
print(f"Attempt {attempt + 1} failed: {str(e)}")
if attempt < max_retries - 1:
print(f"Retrying in {retry_delay} seconds...")
time.sleep(retry_delay)
else:
return f"Failed after {max_retries} attempts: {str(e)}"
# Test
if __name__ == "__main__":
result = chat_with_retry("What is machine learning?")
print(result)
🎨 Using Local Models (Ollama)
Don't want to pay for APIs? Run models locally!
Install Ollama
# Install Ollama (Mac/Linux)
curl -fsSL https://ollama.com/install.sh | sh
# Pull a model
ollama pull llama2
Use with OpenAI-Compatible API
# local_llm.py
from openai import OpenAI
# Point to local Ollama server
client = OpenAI(
base_url='http://localhost:11434/v1',
api_key='ollama' # Required but unused for Ollama
)
def chat_local(prompt: str) -> str:
"""Chat with local Ollama model"""
response = client.chat.completions.create(
model="llama2", # Use any installed Ollama model
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
# Test
if __name__ == "__main__":
result = chat_local("What are the benefits of local LLMs?")
print(result)
Benefits of Local Models:
- ✅ No API costs
- ✅ Complete privacy
- ✅ No rate limits
- ✅ Works offline
- ❌ Slower than cloud
- ❌ Requires good hardware
🎛️ Key Parameters Explained
Temperature (0-2)
Controls randomness in responses:
- 0.0 - Deterministic, focused, consistent
- 0.7 - Balanced (default)
- 1.5+ - Creative, varied, unpredictable
# Deterministic (same answer every time)
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "What is 2+2?"}],
temperature=0
)
# Creative (different each time)
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Write a creative story opening."}],
temperature=1.5
)
Max Tokens
Limits response length:
- 1 token ≈ 0.75 words
- 100 tokens ≈ 75 words
- 1000 tokens ≈ 750 words
Top P (0-1)
Alternative to temperature for controlling randomness:
- Uses nucleus sampling
- 0.1 = only top 10% likely tokens
- 1.0 = all tokens considered
💰 Cost Management
OpenAI pricing (as of 2024):
- GPT-3.5-turbo: $0.50 / 1M input tokens, $1.50 / 1M output tokens
- GPT-4-turbo: $10 / 1M input tokens, $30 / 1M output tokens
Tips to reduce costs:
- Use GPT-3.5-turbo for simple tasks
- Limit max_tokens to prevent long responses
- Cache common responses
- Use local models when possible
- Implement rate limiting
# Cost tracking example
def track_costs(input_tokens: int, output_tokens: int, model: str = "gpt-3.5-turbo"):
"""Calculate API call cost"""
prices = {
"gpt-3.5-turbo": {"input": 0.50, "output": 1.50},
"gpt-4-turbo": {"input": 10.0, "output": 30.0}
}
cost = (
(input_tokens / 1_000_000) * prices[model]["input"] +
(output_tokens / 1_000_000) * prices[model]["output"]
)
return f"${cost:.6f}"
🎯 Best Practices
| Practice | Why |
|---|---|
| ✅ Use environment variables | Keep API keys secure |
| ✅ Implement timeouts | Prevent hanging requests |
| ✅ Add retry logic | Handle transient failures |
| ✅ Stream responses | Better user experience |
| ✅ Set max_tokens | Control costs and response length |
| ✅ Validate inputs | Prevent prompt injection |
| ✅ Cache common queries | Reduce API calls |
| ✅ Monitor usage | Track costs and errors |
🔗 Next Steps
Now that you can integrate LLMs:
- Build a chatbot with a web interface
- Try log summarization for a real-world use case
- Experiment with different models and parameters
📚 Useful Resources
| Resource | Description |
|---|---|
| OpenAI API Docs | Official OpenAI documentation |
| Ollama | Run LLMs locally |
| LangChain | Framework for LLM apps |
| OpenAI Cookbook | Code examples |
| Anthropic Claude | Alternative LLM provider |
You're now ready to integrate LLMs into any application! Start with simple prompts and gradually build more complex features. 🚀
Last updated: February 2026